Min(d)ing English language data on the Web: What can Google tell us?

نویسنده

  • Gunnar Bergh
چکیده

1 Introduction As commonly recognized, the era of modern corpus linguistics is approaching the half-century mark. During the past 50 years, we have witnessed a series of important landmark events in this field, ranging from the early attempts at mechanolinguistics by Juiland and Busa in the 1950s, to the pioneering work on computerized corpora in the 1960s and 1970s, involving first written material in terms of the Brown Corpus and the LOB Corpus, and later spoken material in connection with the London-Lund Corpus; in the 1980s, we have experienced the large-scale corpus projects of Cobuild and the Having now entered the 21 st century, it is clear that there are new challenges ahead for the corpus linguist. In terms of standard corpora, for example, we know that the American National Corpus (ANC) is under development, a parallel to the BNC with 100 million words of transatlantic English (e.g. Ide et al. 2002), and there is also a great deal of work going on with sophisticated varieties of learner corpora and multilingual (parallel) corpora (e.g. Botley et al. 2000; Granger 2004). However, the biggest challenge of today is undoubtedly the growing body of text-based information available on the World Wide Web (henceforth the Web). While originally intended as a pure information source only, this material forms in fact the largest store of textual data in existence, and as such it constitutes a tantalizing resource for various linguistic purposes. Let us look at some initial figures. As regards the size of the material on the Web, a rough estimate indicates that there are currently (December 2004) about eight billion Web pages available (cf. containing perhaps as much as 50 terabytes of text: at a generous average of 10 bytes per word (cf. Kilgarriff and Grefenstette 2003), these figures suggest a body of no less than five trillion (5 000 billion) words in one form or another. 26 Out of this massive multilingual collection of texts and text fragments, it appears that about two thirds are written in English (e.g. Xu 2000), although the proportion of non-English texts seems to have increased in recent years (e.g. Grefen-stette and Nioche 2000). This means that there is probably something in the range of 3 000 billion words of English to be found on the Web, forming a virtual English supercorpus ready for use by enterprising linguists in all manner of language research (cf. Bergh et …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Download Statistics - What Do They Tell Us?: The Example of Research Online, the Open Access Institutional Repository at the University of Wollongong, Australia

A study was undertaken of download and usage statistics for the institutional repository at the University of Wollongong, Australia, over the six-month period January-June 2006. The degree to which research output was made available, via open access, on Internet search engines was quantified. Google was identified as the primary access and referral point, generating 95.8% of the measurable full...

متن کامل

Impact of Online Setting Collaboration through Strategy-Based Instruction on EFL Learners’ Self-efficacy and Oral Skills

This study aimed to investigate the impact of web-based cooperative teaching through strategy-based instruction on EFL learners’ speaking and listening skills. Moreover, the use of cooperative teaching was hypothesized to have impact on the EFL learners’ self-efficacy. To this purpose, the study followed a mixed-methods design by implementing both qualitative and quantitative data gathering pro...

متن کامل

Query Expansion from Wikipedia and Topic Web Crawler on CLIR

In this paper, we report various strategies for query expansion (QE) in the NTCIR-8 IR4QA subtask. We submit the results of twelve runs from the formal run, which include cross-language information retrieval from English to traditional Chinese, from English to simplified Chinese, and from English to Japanese in the official T-run, D-run and DN-run. Our approach uses Google translation and the O...

متن کامل

English Teachers Professional Development Needs for Web Development Skills: Meeting the Challenges of Teaching English Language in the Information Age

Utilizing the resources of the web in educational practices has made instructional processes more efficient and interesting and has made the learning process on the other hand much easier and attractive. With the web, English language teachers now have the option of engaging learners in online (web-based) instructions in addition to the use of conventional classroom instructions or alternativel...

متن کامل

The Syntax and Semantics of the Scheme Programming Language

For example, the syntax rules of the English language tell us that person, tall, told, the, a, me and joke are legal words, and that The tall person told me a joke is a legal sentence, whereas pkrs, shrel and fdadfa are not legal words, and Person tall told a me the is not a legal sentence. The semantic rules tell us what each of the words mean (e.g., what objects the nouns denote and what proc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005